Automatic Quantification of Serial PET/CT Images for Pediatric Hodgkin Lymphoma Patients Using a Longitudinally-Aware Segmentation Network

$\textbf{Purpose}$: Automatic quantification of longitudinal changes in PET scans for lymphoma patients has proven challenging, as residual disease in interim-therapy scans is often subtle and difficult to detect. Our goal was to develop a longitudinally-aware segmentation network (LAS-Net) that can quantify serial PET/CT images for pediatric Hodgkin lymphoma patients. $\textbf{Materials and Methods}$: This retrospective study included baseline (PET1) and interim (PET2) PET/CT images from 297 patients enrolled in two Children's Oncology Group clinical trials (AHOD1331 and AHOD0831). LAS-Net incorporates longitudinal cross-attention, allowing relevant features from PET1 to inform the analysis of PET2. Model performance was evaluated using Dice coefficients for PET1 and detection F1 scores for PET2. Additionally, we extracted and compared quantitative PET metrics, including metabolic tumor volume (MTV) and total lesion glycolysis (TLG) in PET1, as well as qPET and $\Delta$SUVmax in PET2, against physician measurements. We quantified their agreement using Spearman's $\rho$ correlations and employed bootstrap resampling for statistical analysis. $\textbf{Results}$: LAS-Net detected residual lymphoma in PET2 with an F1 score of 0.606 (precision/recall: 0.615/0.600), outperforming all comparator methods (P<0.01). For baseline segmentation, LAS-Net achieved a mean Dice score of 0.772. In PET quantification, LAS-Net's measurements of qPET, $\Delta$SUVmax, MTV and TLG were strongly correlated with physician measurements, with Spearman's $\rho$ of 0.78, 0.80, 0.93 and 0.96, respectively. The performance remained high, with a slight decrease, in an external testing cohort. $\textbf{Conclusion}$: LAS-Net demonstrated significant improvements in quantifying PET metrics across serial scans, highlighting the value of longitudinal awareness in evaluating multi-time-point imaging datasets.


Introduction
Among pediatric cancers, Hodgkin lymphoma (HL) is a highly curable malignancy (1), with 5-year survival exceeding 90% for patients receiving combination chemotherapy, radiation, or combined treatment (2).Despite this, pediatric patients face a significant risk of long-term side effects from therapeutic toxicities.Emerging evidence suggests that early responders to treatment may benefit from de-escalated therapies (3).Several clinical trials have used response assessment on interim Fluorodeoxyglucose (18F-FDG) PET scans after two cycles of chemotherapy for risk stratification (2,4).Currently, PET response is assessed using visual evaluation criteria, such as the Deauville score (DS) based on the lesion with the most intense uptake (5).Compared to the qualitative assessment, quantitative PET metrics have shown promise in guiding lymphoma treatment strategies (6,7).However, its use often relies on manual lesion segmentation, which is difficult and time-consuming, and has been limited to clinical trial settings.Deep learning (DL) algorithms have the potential to overcome this limitation and enable automatic PET analysis.
There have been extensive studies using DL to segment lymphoma (8)(9)(10)(11) and extract quantitative metrics (12)(13)(14) in PET scans.However, existing algorithms focus on quantifying baseline tumor burden, overlooking the important role of interim PET in response assessment.Compared to baseline PET, analyzing interim PET poses significant challenges, as tumor uptake is often subtle and difficult to differentiate from confounding physiologic or inflammatory FDG activity.Physicians typically rely on cross comparison with baseline PET to identify residual lymphoma, but methods for incorporating this information to interim PET analysis remain underexplored.
In this study, we aimed to develop a longitudinally-aware segmentation network (LAS-Net) for automatic quantification of serial PET/CT images, facilitating PET-adaptive therapy for pediatric HL patients.Central to our design is a dual-branch architecture: one branch dedicated to segmenting lymphoma in baseline PET, while the other detects residual lymphoma in interim PET.The model was trained using PET/CT images from multiple centers as part of a phase 3 clinical trial.To assess the performance of our method, we evaluated its detection performance in interim PET and its segmentation performance in baseline PET.Furthermore, we extracted various quantitative PET metrics and quantified their agreement with physician measurements.We compared LAS-Net to other methods, including those with and without the integration of baseline PET information.Lastly, we performed external testing using data from another multi-center clinical trial of pediatric HL.

Patient Cohort
This retrospective study included patients from two Children's Oncology Group (COG) clinical trials: AHOD1331 (ClinicalTrials.govnumber, NCT02166463) (2) and AHOD0831 (NCT01026220) (4).Both are phase 3 trials of pediatric patients aged 2-21 diagnosed with high-risk HL.The AHOD1331 trial assessed the utility of incorporating Brentuximab Vedotin with chemotherapy while the AHOD0831 trial evaluated the effects of combination chemotherapy together with radiation therapy.Baseline and interim FDG PET/CT images were gathered and transferred from IROC Rhode Island to our institution under data use agreements.Retrospective analysis was approved by institutional review board with no requirement of additional consent from patients.Of the 600 patients enrolled in the AHOD1331 trial, 200 with complete PET/CT datasets were randomly selected and used as our internal cohort.Among the 166 patients from the AHOD0831 trial, 97 had complete PET/CT datasets, and these were used for external testing.

Data Labeling
For the AHOD1331 dataset, three experienced physicians with board certification in nuclear medicine (NM) provided lesion-level annotations for both baseline and interim PET using a semi-automated workflow (LesionID, MIM Software, Cleveland, Ohio), following a multi-reader adjudication process.One physician (M.S. with 5 years of experience) labelled all 200 cases while the other physicians (S.B.P. and S.Y.C., both with over 15 years of experience) each adjudicated 100 of the cases, refining the annotations by adding, deleting, or modifying contours as necessary, which were then reviewed and confirmed by the first reader.To assess residual tumors on interim PET scans, readers assigned each lesion a score on a 5-point scale, using the same visual evaluation criteria as the Deauville criteria (5).It is referred to as lesion-level Deauville score (LDS) in this study.Any residual disease scoring 3 to 5 had an associated contour.All segmented lesions were labeled according to physician confidence (non-equivocal or equivocal).An equivocal lesion is defined as a lesion that the physician is unsure whether its PET uptake corresponds to lymphoma or is due to physiological activity.Annotators were trained using a labeling guide (described in Appendix S1).
For the AHOD0831 dataset, PET images for each patient were annotated by one of two board-certified NM physicians (J.K. and I.L., both with 5 years of experience) on Mirada XD (Oxford, UK) software as part of a prior research study (12,15).Table 1 summarizes the characteristics of these two datasets.
Table 1: Demographics and clinical characteristics of our internal and external cohorts.

LAS-Net Architecture
We designed LAS-Net with a dual-branch architecture to accommodate baseline and interim PET/CT images, as illustrated in Figure 1A.One branch exclusively processes baseline PET (PET1) and predicts the corresponding lesion masks.The other branch focuses on interim PET (PET2), but also utilizes information extracted from the PET1 branch to generate masks of residual lymphoma.This architecture enables our model to gather useful information from PET1 to inform and improve the analysis of subsequent scans.Meanwhile, it ensures a one-way information flow, preventing PET2 information from influencing PET1 analysis.Like many segmentation networks, LAS-Net was adapted from a UNet-like architecture.It is based on 3D SwinUNETR (16), a state-of-the-art (SOTA) model comprising a Swin Transformer (17) encoder and a convolutional neural network (CNN) decoder.In LAS-Net, each convolutional block is a stack of two convolution units (3×3×3 convolution sub-layers, instance normalization, leaky ReLU) with a residual connection.Beyond these components, we have introduced two critical mechanisms to allow information from the PET1 branch to influence the PET2 branch.One is the longitudinally-aware window attention (LAWA) on the encoder side, and the other is the longitudinally-aware attention gate (LAAG) on the decoder side.
Figure 1B illustrates the structure of the LAWA module.Compared to the standard Swin Transformer block (17), this module introduces a window-based multi-head cross-attention (W-MCA) layer with a window size of 7×7×7 in the PET2 branch.The W-MCA takes the query vectors from PET2 features and the key and value vectors from PET1 features.It computes the attention matrix of the query and key using scaled dot product, allowing the model to dynamically allocate focus based on the relevance of regions across PET1 and PET2.The value vectors are then reweighted by this attention matrix and added to input PET2 features.
Figure 1C presents the design of the LAAG module.Similar to the original attention gate (18), the LAAG module processes inputs from both the prior layer and skip connections, generating attention coefficients.To enable additional longitudinal awareness, we concatenate the attention coefficients derived from PET1 and PET2 and convolve them with a learnable 7×7×7 kernel to refine the PET2 attention coefficients.This CNN-based cross-attention gate allows the LAAG module to select PET2 features using information from the PET1 branch.
LAS-Net operates on 112×112×112 patches from co-registered baseline and interim PET/CT images.Except for the longitudinal cross-attention components, all other weights in the model are shared between the PET1 and PET2 branches.The model was jointly optimized for PET1 and PET2 lesion segmentation using a compound loss, comprised of cross-entropy and Dice loss.Models were trained and evaluated through fivefold cross-validation (N=40 in each fold).Implementation details can be found in Appendices S2-3.

Quantitative PET Metrics
In baseline PET analysis, we evaluated model performance using the Dice coefficient, false positive volume (FPV), and false negative volume (FNV) per patient.The quantitative metrics computed for PET1 scans (definitions in Appendix S4) included metabolic tumor volume (MTV) (19), total lesion glycolysis (TLG) (20), maximum lesion standardized uptake value (SUVmax), maximum tumor dissemination (Dmax) (21), maximum distance between the lesion and the spleen (Dspleen) (22) and the number of lesions.Since interim PET analysis primarily involves SUVmax or SUVpeak measurements (23), accurate tumor segmentation is not needed.Consequently, for PET2 scans, we evaluated our model's performance using detection F1 scores, precision, and recall.The evaluation criterion for lesion detection is defined as follows: a predicted lesion was considered a true positive if it overlapped with any true lesion identified by the physician.A true lesion without overlap was classified as a false negative.Additionally, we included a more stringent criterion which required the SUVmax measured for the predicted lesion to be matched with the true lesion, otherwise the predicted lesion was classified as a false positive and the true lesion was classified as a false negative.Lesions detected by the model that were considered as equivocal by the physicians were not counted as false positives or true positives in our evaluation.We also extracted quantitative PET2 metrics from model predictions, including SUVmax, percentage difference between baseline and interim SUVmax (∆SUVmax), qPET (23), and the number of residual lesions.Notably, ∆SUVmax and qPET have been demonstrated to have predictive potential for patient prognosis (23)(24)(25).The agreement between automated PET metrics and physician measurements was quantified by Spearman's ρ correlations.

Model Comparison
We compared the performance of LAS-Net to other models trained on our dataset, including DynUNet (26,27), SegRes-Net (28) and SwinUNETR (16).No longitudinal cross-attention was incorporated into these models' architectures.We also evaluated Clinical Knowledge-Driven Hybrid Transformer (CKD-Trans) (29) and Spatial-Temporal Transformer (ST-Trans) (30), both of which integrated information from PET1 into PET2 analysis using cross-attention.Notably, CKD-Trans and ST-Trans were initially developed for tumor segmentation in multiparametric MRI.Table 2 summarizes key differences among these models.Furthermore, we implemented a previous technique (15,31) that used deformable registration between PET1 and PET2 scans to reduce false positives in PET2 lesion masks.Specifically, segmentation masks predicted for PET1 are propagated to PET2 using deformable registration, and then PET2 contours that do not overlap with PET1 contours are excluded.In our work, we refer to this technique as "mask propagation through deformable registration" (MPDR).Quantitative results were reported both with and without MPDR.Additionally, we conducted ablation studies to assess the effectiveness of individual components in LAS-Net.

Agreement of Predicted DSs and Physician-assigned DSs
DSs serve as an internationally accepted scoring system for assessing treatment response in interim PET.Two types of thresholds (DS3-DS5 positive or DS4-DS5 positive) are often used to categorize patients into adequate or inadequate response classes, depending on the clinical context (e.g., certain trials focused on treatment de-escalation may consider DS3 as an inadequate response) (32).DS for each individual patient is determined by the residual lesion with the highest uptake (5).Although our model was not trained to output patient-level DSs, we can estimate DSs by converting extracted qPET values to DSs using the qPET criterion (23).In this context, the patient-level DS was inferred from the highest-valued LDS within the patient.This indirect method allowed for a comparison of model-predicted DSs and physician-assigned DSs.The level of agreement was quantified by the F1 score and the Kappa index.

Statistical Analysis
The 95% confidence intervals (CIs) for our results were derived using nonparametric bootstrap resampling (33) with 10,000 repetitive trials.Baseline and interim PET/CT scans were analyzed separately in all statistical evaluations.Table 2: Characteristics of the models evaluated in this study.Specifically, bootstrap resampling was performed at the patient level, meaning that patients were sampled with replacement and each patient's baseline and interim scans were included independently in their respective analyses.For each evaluation metric, the difference between LAS-Net and a comparator method was considered statistically significant at the 0.05 level if the metric values computed for LAS-Net exceeded those of the comparator method in 95% of trials.

Data Availability
The COG clinical trial data is archived in NCTN Data Archive.Our algorithm was implemented using the Auto3dSeg pipeline in Monai (27).The code and models are available in the open-source project: https://github.com/xtie97/LAS-Net.
For automatic PET1 analysis (Figure 3), LAS-Net attained a mean Dice score of 0.772 (95%CI, 0.752, 0.791), with average FNV of 10.80 ml (95%CI, 8.53, 13.46) and FPV of 9.68 ml (95%CI, 7.50, 12.40) per patient.It demonstrated comparable performance to the best model, DynUNet, which had a Dice score of 0.779 (95%CI, 0.758, 0.797, P=0.32).Among the PET1 metrics extracted by LAS-Net, MTV, TLG and SUVmax exhibited high correlations with the values measured by physicians (ρ=0.93 for MTV, 0.96 for TLG, 0.90 for SUVmax).No significant differences were observed across the four evaluated models for these metrics.For the distance-based metrics, Dmax and Dspleen, LAS-Net showed Scatter plots in Figure 4 visualize the agreement between PET metrics assessed by physicians and those measured by LAS-Net.

Qualitative evaluation
Figure 5 displays images from nine sample cases, each comprising baseline and interim lesion masks predicted by LAS-Net along with physician annotations.In cases A-F, LAS-Net successfully identified the residual lesions, including the lesions with the highest PET uptake (LDS4 or LDS5) as well as those with lower uptake (LDS3).Notably, in case B, LAS-Net detected new lesions, not present on PET1, located near the neck and bladder.If MPDR was applied, these true positive lesions would be excluded, leading to an underestimation of SUVmax and qPET.
In scenarios with multiple dispersed PET2 lesions (cases G-H), LAS-Net had difficulties in accurately identifying all lesions.Additionally, LAS-Net occasionally identified false positive lesions in negative cases (case I), especially when the residual SUVs were close to the mediastinum uptake.For baseline lymphoma segmentation, LAS-Net performed consistently well at delineating bulky diseases.Nonetheless, it was less effective in detecting small lesions situated at a distance from the primary disease sites, which was true for all comparator methods.To assess the benefits of integrating longitudinal awareness into the model architecture, we compared the predictions of LAS-Net with those of DynUNet in Figure 6.Without applying MPDR, the PET2 false positives predicted by DynUNet significantly affected the accuracy of automated PET2 metrics.Especially in case D, DynUNet mistakenly identified brown fat uptake as residual lymphoma.

Agreement of Model-Extract DS and Physician Assigned DS
Table 3 presents patient-level DS classification results.If grouping cases into two categories -scores of DS 1, 2 vs. DS 3, 4, 5 -LAS-Net attained an F1 score of 0.752 (precision/recall: 0.687/0.836)and Cohen's kappa of 0.630, outperforming (P<0.05) the top comparator, ST-Trans (with MPDR), which had 0.660 for F1 and 0.501 for Cohen's kappa.If grouping based on DS of 1, 2 and 3 vs.DS 4 and 5, LAS-Net achieved an F1 score of 0.633 (precision/recall: 0.500/0.867)and Cohen's kappa of 0.549, and was superior to other evaluated methods.

Ablation Studies
The results of ablation studies are shown in Table 4.We found that both LAWA and LAAG modules for longitudinal cross-attention improved lesion detection performance in PET2.Also, the inclusion of the PET1 branch and the combined PET1 and PET2 training enhanced the model's capability to quantify PET2 scans.The choice of registration methods between PET1 and PET2 did not impact model performance.When input baseline and interim PET/CT images were co-registered using rigid registration, the performance was slightly worse but not significantly different from that achieved with deformable registration (P=0.22 for F1 scores).

External Testing
We applied LAS-Net, trained on all AHOD1331 data, to the external AHOD0831 dataset.The detection F1 score in PET2 was 0.525 (95%CI, 0.456, 0.582) and the Dice score in PET1 was 0.684 (95%CI, 0.655, 0.711).Regarding quantitative PET metrics, the Spearman's ρ correlations between LAS-Net predictions and physician measurements showed a slight decrease: 0.70 for PET2 ∆SUVmax, 0.69 for PET2 qPET, 0.87 for PET1 MTV and 0.89 for PET1 TLG.Detailed results along with example cases are provided in Appendix S6.

Discussion
In this study, we introduced a novel deep-learning-based method (LAS-Net) for longitudinal analysis of serial PET/CT images in pediatric HL patients.Our approach was different from prior methods in two aspects.First, it used longitudinal cross-attention to extract baseline PET information for improved analysis of interim PET.Second, it adopted a dual-branch architecture to enable automatic quantification of both baseline and interim scans.Through comparative and ablation studies, we validated the effectiveness of our approach using data from two multi-center clinical trials, highlighting its potential to deliver rapid and consistent assessment of PET tumor burden and response.
Existing DL algorithms for detecting lymphoma lesions have been limited to analyzing PET1 scans without the ability to quantify PET2 for response assessment and outcome prediction.This limitation is primarily due to the challenge of detecting residual lymphoma in PET2, which often has low FDG uptake.It is even a difficult task for expert physicians, and they usually rely on PET1 (i.e., viewing PET1 and PET2 side-by-side) to identify residual lymphoma.Our method was intended to fill this gap by integrating longitudinal cross-attention mechanisms into the architecture.While previous research has leveraged prior PET data for interim image denoising (34) and response classification (35), our work distinguishes itself by incorporating longitudinal awareness to improve the analysis of multi-time-point imaging datasets.This design also aligns with the way physicians interpret PET scans over time.While our current study focuses on quantification, we believe that the principle underlying our method has broader applications.It can potentially be extended to the DL models aimed at diagnosis, especially for tasks requiring analysis of prior imaging data.It can also be adapted to accommodate multi-modal inputs by adding cross-attention modules in the intermediate layers.This may further improve the model's accuracy but challenges, such as data alignment, increased computational complexity and the need for data fusion techniques would need to be addressed.
To develop a model for longitudinal response assessment in PET scans, we chose to jointly optimize our model for PET1 and PET2 analysis.This substantially improved model performance in identifying residual lesions in PET2, as evidenced by the ablation study.However, our model's PET1 segmentation performance was no better than other SOTA models trained on PET1 scans.This was expected, as the PET1 branch should not, in principle, benefit from the PET2 branch.
The quantitative PET metrics that we investigated have been demonstrated to be better than visual criteria at guiding lymphoma treatment (6,7).For PET1, we found that MTV and TLG were the metrics most accurately quantified by the DL model.They are also the most time-consuming metrics for physicians to measure.Newly proposed distance-based metrics (Dmax, Dspleen) were harder for accurate quantification, because a single false positive or false negative can have a large impact on the values of these metrics.For PET2, we focused on measuring SUVmax, qPET, and the response metric ∆SUVmax, as these have been associated with patient outcome in previous studies (23)(24)(25).Detecting residual lymphoma on PET2 was very challenging for models that did not use longitudinal information, and this was reflected in their poor F1 scores and their performance in PET2 quantification.Even with MPDR, these models were inferior to LAS-Net.
This study has several limitations.First, despite showing clear advancements, we have not yet evaluated LAS-Net's clinical utility, and there is still room for further improvements in its performance.Future work could explore advanced training strategies, such as pretraining or semi-supervised techniques, or use a larger labeled dataset.Second, we focused on quantitative PET metrics (MTV, qPET, etc.).Future research will aim to associate these metrics with patient outcome.Third, the labeling process for our external dataset differed from that used for our internal dataset.It is unclear if the performance drop in external testing is attributed to dataset shift, or different annotation quality.Additionally, this external cohort may not be large enough to evaluate the model's real-world performance.Further investigation on a larger dataset collected from routine clinical practice is needed.Fourth, our current model only operates at two imaging time points.In future work, we hope to develop a unified framework that can process PET/CT images across all time points.Fifth, this study did not investigate how the predicted DSs impacted the Lugano classification, which could be an area of focus for future research.Lastly, we only evaluated our algorithm in the cohorts of pediatric HL patients.Whether it is applicable to other diseases or populations requires further investigation.
In conclusion, our study introduced a longitudinally-aware segmentation network to address the challenges of automatic quantification of serial PET scans.The proposed method demonstrated improved lesion detection performance in interim PET scans compared to other methods, without sacrificing segmentation accuracy in baseline PET scans.This highlights the critical role of incorporating longitudinal awareness into AI algorithms for tasks involving analyzing multi-time-point images.were found for other characteristics.In the external cohort, no significant differences were noted between subgroups for any characteristic.
We then compared the performance between the internal and external cohorts for each subgroup.For interim PET, statistically significant performance differences between the internal and external cohorts were found in the following subgroups: age ≤ 15 years, female, weight ≤ 60 kg, normalized injected dose > 5.5 MBq/kg and Siemens.For baseline PET, the internal performance was consistently higher than the external performance across all subgroups by a significant margin.It is important to note that for the external cohort, there were only 6 scans from non-overlapping scanner models (i.e., these models were only present in the external cohort), which resulted in large error bars.

Figure 1 :
Figure 1: The architecture of longitudinally-aware segmentation network (LAS-Net).(A) The dual-branch design accommodates baseline (PET1) and interim (PET2) PET/CT images.One branch is dedicated to processing PET1 while the other branch focuses on PET2, using features extracted from PET2 as well as the features from the PET1 branch.(B) The longitudinally-aware window attention (LAWA) module introduces multi-head cross-attention following two self-attention blocks.All attention layers have a window size of 7. (C) The longitudinally-aware attention gate (LAAG) introduces a learnable convolutional layer (kernel size=7) following the standard self-attention gate to refine the attention coefficients for PET2.Both LAWA and LAAG modules only allow one-way information flow from the PET1 to the PET2 branch.

Figure 2 :
Figure 2: Performance comparison of interim PET lesion detection in the internal cohort.Results are reported with and without mask propagation through deformable registration (MPDR).Notably, CKD-Trans and ST-Trans utilized baseline lesion masks predicted by DynUNet for MPDR.(A) and (B) present the results of detection F1 scores, precision, and recall using different criteria to classify true positives.In (A), a predicted lesion is classified as a true positive if it overlaps with at least one voxel of the reference lesion.In (B), a predicted lesion is considered a true positive if its SUVmax is matched with the reference lesion's SUVmax.(C)quantifies the agreement between model predictions and physician measurements for interim PET metrics.In the plots, actual metric values and Spearman's correlation values are marked by circles with error bars indicating 95% confidence intervals.LAS-Net showed significantly improved performance (P < 0.05) over all comparator methods in F1 scores and interim PET metrics.The only exception was for qPET where its performance did not significantly surpass SegResNet with MPDR (P = 0.057).SUVmax = maximum lesion standardized uptake value, ∆SUVmax = percentage difference of SUVmax between the baseline and interim scans.

Figure 3 :
Figure 3: Performance comparison of baseline PET lesion segmentation in the internal cohort.(A) shows violin plots of evaluation metrics, where vertical lines represent the interquartile ranges and white circles mark the median values.(B) compares the correlations between baseline PET metrics assessed by physicians and those measured by deep learning models.Actual Spearman's correlation values are marked by circles and their 95% confidence intervals are denoted by error bars.FPV = false positive volume, FNV = false negative volume, MTV = metabolic tumor volume, TLG = total lesion glycolysis, SUVmax = maximum lesion standardized uptake value, Dmax = maximum tumor dissemination, Dspleen = maximum distance between the lesion and the spleen.

Figure 4 :
Figure 4: Comparison of physician-based and automatically extracted PET metrics.Spearman's ρ correlations are shown in the top left corner of each plot.Correlation values are presented as mean [2.5th percentile, 97.5th percentile].

Figure 5 :
Figure 5: Nine different examples of longitudinally-aware segmentation network (LAS-Net) output.Each case has maximum intensity projections (MIPs) of baseline and interim PET images with overlaying MIPs of the reference and predicted lesion masks.DS = Deauville score.

Figure 6 :
Figure 6: Four examples comparing the proposed longitudinally-aware segmentation network (LAS-Net) with DynUNet, a model without longitudinal cross-attention.Each case has maximum intensity projections (MIPs) of baseline and interim PET images, overlaid with MIPs of reference and predicted lesion masks.For both LAS-Net and DynUNet output, results incorporating mask propagation through deformable registration (MPDR) are also included.DS = Deauville score.

Figure E1 :
Figure E1: Performance comparison of interim PET lesion detection in the internal cohort with the inclusion of equivocal lesions.Results are reported with and without mask propagation through deformable registration (MPDR).Both (A) and (B) show the results of detection F1 scores, precision, and recall but they adopt different criteria for classifying true positives.(A) uses the criterion that a predicted lesion is considered as a true positive if it overlaps with at least one voxel of the reference lesion.(B) uses the criterion that the predicted lesion's SUVmax should be matched with the reference lesion's SUVmax for it to be considered a true positive.In the plots, actual metric values are marked by circles with error bars indicating 95% confidence intervals.

Figure E2 :
Figure E2: Comparison of physician-based and automatically extracted PET metrics in the external AHOD0831 cohort.Spearman's ρ correlations are shown in the top left corner of each plot.Correlation values are presented as mean [2.5th percentile, 97.5th percentile].

Figure E3 :
Figure E3: Six examples of longitudinally-aware segmentation network (LAS-Net) output in the external AHOD0831 cohort.Each case has maximum intensity projections (MIPs) of baseline and interim PET images with overlaying MIPs of the reference and predicted lesion masks.Note that lesion-level Deauville scores are not available for the AHOD0831 data.DS = Deauville score

Figure E4 :
Figure E4: Performance comparison of interim PET lesion detection in the internal cohort using the criterion that a predicted lesion is classified as a true positive if the Dice coefficient between the predicted lesion and the reference lesion exceeds 0.5.

Figure E5 :
Figure E5: Internal and external performance across subgroups based on age, sex, weight, normalized injected dose, and scanner models.(A) shows lesion detection performance on interim PET across subgroups, and (B) shows lesion segmentation performance on baseline PET across subgroups.Each metric value is marked by a circle, with error bars indicating the 95% confidence intervals.It is important to note that for the external cohort, there were only 6 scans from non-overlapping scanner models (i.e., these models were only present in the external cohort), which resulted in large error bars.

Table 3 :
Results of binary classification for adequate/inadequate treatment response using model-predicted Deauville scores.

Table 4 :
Ablation studies evaluating the effectiveness of each component in LAS-Net for interim lesion detection.